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Abstract 

We consider Bayesian inference in sequential latent variable models 
in general, and in nonlinear state space models in particular (i.e., state 
smoothing). We work with sequential Monte Carlo (SMC) algorithms, 
which provide a powerful inference framework for addressing this prob¬ 
lem. However, for certain challenging and common model classes the 
state-of-the-art algorithms still struggle. The work is motivated in partic¬ 
ular by two such model classes: (i) models where the state transition ker¬ 
nel is (nearly) degenerate, i.e. (nearly) concentrated on a low-dimensional 
manifold, and (ii) models where point-wise evaluation of the state tran¬ 
sition density is intractable. Both types of models arise in many appli¬ 
cations of interest, including tracking, epidemiology, and econometrics. 
The difficulties with these types of models is that they essentially rule 
out forward-backward-based methods, which are known to be of great 
practical importance, not least to construct computationally efficient par¬ 
ticle Markov chain Monte Carlo (PMCMC) algorithms. To alleviate this, 
we propose a “particle rejuvenation” technique to enable the use of the 
forward-backward strategy for (nearly) degenerate models and, by exten¬ 
sion, for intractable models. We derive the proposed method specifically 
within the context of PMCMC, but we emphasise that it is applicable to 
any forward-backward-based Monte Carlo method. 


1 Problem formulation 

State space models (SSMs) are widely used for modelling time series and dy¬ 
namical systems. A general, discrete-time SSM can be written as 

Xt\xt- 1 ^ f{xt\xt-i), (la) 

yt\xtg{yt\xt), (lb) 
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where Xt G X is the latent state, y* G Y is the observation (both at time t) and 
/(•) and g{-) are probability density functions (PDFs) encoding the state transi¬ 
tion and the observation likelihood, respectively. The initial state is distributed 
according to Xi ~ g{xi). 

Statistical inference in SSMs typically involves computation of the smooth¬ 
ing distribution, that is, the posterior distribution of a sequence of state vari¬ 
ables Xt '■= {xi, Xt) G conditionally on a sequence of observations 
Yt ■= {yi, • ■ ■, Vt) G Y^. The smoothing distribution plays a key role both 
for offline (batch) state inference and for system identification via data aug¬ 
mentation methods, such as expectation maximisation m and Gibbs sampling 

[33]. 

The motivation for the present work comes from two particularly challenging 
classes of SSMs: 

(Ml) If the state transition kernel /(•) of the system puts all probability mass 
on some low-dimensional manifold, we say that the transition is degenerate (for 
simplicity we use “probability density notation” even in the degenerate case). 
Degenerate transition kernels arise, e.g., if the dynamical evolution is modeled 
using additive process noise with a rank-deficient covariance matrix. Models of 
this type are common in certain application areas, e.g., navigation and tracking; 
see for several examples. Likewise, if /(•) is concentrated around a low¬ 
dimensional manifold (i.e., the transition is highly informative, or the process 
noise is small) we say that the transition is nearly degenerate. 

(M2) If the state transition density function /(•) is not available on closed 
form, the transition is said to be intractable. The typical scenario is that /(•) 
is a regular (non-degenerate) PDF which it is possible to simulate from, but 
which nevertheless is intractable. At first, this scenario might seem contrived, 
but it is in fact quite common in practice. In particular, whenever the dynamical 
function is defined implicitly by some computer program or black-box simulator, 
or as a “complicated” nonlinear transformation of a known noise input, it is 
typically not possible to explicitly write down the corresponding transition PDF; 
see, e.g., [31 la for examples. 

The main difficulty in performing inference for model classes (Ml) and 
(M2) lies in that the so-called baekward kernel of the model (see, e.g., [30]), 

p{xt I xt+i,yi-t) oc fixt+i I xt)p{xt I yi,t), (2) 

will also be (nearly) degenerate or intractable, respectively. This is problematic 
since many state-of-the-art methods rely on the backward kernel for inference; 
see, e.g., [M1EI1I13- In particular, the backward kernel is used to implement 
the well-known forward-backward smoothing strategy. We will come back to 
this in the subsequent sections when we discuss the details of these inference 
methods and how we propose an extension to the methodology geared toward 
this issue. 

The main contribution of this paper is constituted by a construction allow¬ 
ing us to replace the simulation from the (problematic) backward kernel, with 
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simulation from a joint density over a subset of the future state variables. Im¬ 
portantly, this joint density will typically have more favourable properties than 
the backward kernel itself, whenever the latter is (nearly) degenerate. Simulat¬ 
ing from the joint density results in a rejuvenation of the state values, which 
intuitively can be understood as a bridging between past and future state vari¬ 
ables. Furthermore, to extend the scope of this technique we propose a nearly 
degenerate approximation of an intractable transition model. Using this approx¬ 
imation we can thus use the proposed particle rejuvenation strategy to perform 
inference also in models with intractable transitions. 


2 Background, methodology, and related work 

2.1 Computational inference 

The strong assumptions of linearity and Gaussianity that were originally invoked 
for state space inference have been significantly weakened by the development 
of computational statistical methods. Among these, Markov chain Monte Carlo 
(MCMC) and sequential Monte Carlo (SMC, a.k.a. particle filter) methods play 
prominent roles. MCMC methods (see e.g., [HUH) are based on simulating a 
Markov chain which admits the target distribution of interest—here, the joint 
smoothing distribution—as its unique stationary distribution. SMC methods, 
on the other hand (see, e.g., [muin]), use a combination of sequential impor¬ 
tance sampling m and resampling [15] to approximate a sequence of probability 
distributions defined on a sequence of measurable spaces of increasing dimen¬ 
sion. This is accomplished by approximating each target distribution by an 
empirical point-mass distribution based on a collection of random samples, re¬ 
ferred to as particles, with corresponding non-negative importanee weights. For 
instance, the target distributions of an SMC sampler can comprise the smooth¬ 
ing distributions for an SSM for t = 1, ..., T and, indeed, SMC methods have 
emerged as a key tool for approximating the flow of smoothing distributions for 
general SSMs. 

In addition to these methods, we have in recent years seen much interest in 
the combination of SMC and MCMC in so-called particle MCMC (PMCMC) 
methods [5|. These methods are based on an auxiliary variable construction 
which opens up the use of SMC (or variants thereof) to construct efficient, 
high-dimensional MCMC transition kernels. The introduction of PMCMC has 
spurred intensive research in the community spanning methodological (9] jHI] |48l 
15^ . theoretical imiiiiKiniiiTKii, and applied [Ml [Ml nano] work. 

In this paper we consider in particular the particle Gibbs (PC) algorithm, 
introduced by Andrieu et al. l^. The PC algorithm relies on running a modified 
particle filter, in which one particle trajectory is set deterministically according 
to an a priori specified reference trajectory (see Section |^. After a complete 
run of the SMC algorithm, a new trajectory is obtained by selecting one of the 
particle trajectories with probabilities given by their importance weights. This 
results in a Markov kernel on the space of trajectories X^. Interestingly, the 
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conditioning on a reference trajectory ensures that the limiting distribution of 
this Markov kernel is exactly the target distribution of the sampler (e.g., the 
joint smoothing distribution) for any number of particles N > 2 used in the 
underlying SMC algorithm. The PG algorithm can be interpreted as a Gibbs 
sampler for the extended model where the random variables generated by the 
SMG sampler are treated as auxiliary variables. 

Like any Gibbs sampler, PG has the advantage over Metropolis-Hastings 
of not requiring an accept/reject stage. However, the resulting chain is still 
liable to mix (very) slowly if the particle hlter suffers from path-space degener¬ 
acy [iniisn]. Unfortunately, path degeneracy is inevitable for high-dimensional 
(large T) problems, which significantly reduces the applicability of PG. This 
problem has been addressed in a generic setting by adding additional sam¬ 
pling steps to the PG sampler, either during the filtering stage, known as par¬ 
ticle Gibbs with ancestor sampling (PGAS) [3T], or in an additional backward 
sweep, known as particle Gibbs with backward simulation (PGBS) [I51H7] . The 
improvement arises from sampling new values for individual particle ancestor 
indexes, and thus allowing the reference trajectory to be updated gradually 
which can mitigate the effects of path degeneracy. It has been found that this 
can vastly improve mixing, making the method much more robust to a small 
number of particles as well as growth in the size of the data. 

2.2 Tackling (nearly) degenerate or intractable transitions 

A problem with PGAS and PGBS, however, is that they rely on a particle 
approximation of the backward kernel for updating the ancestry of the particles. 
For an SSM the backward kernel is given by ([^, which implies that if /(•) is 
(nearly) degenerate or intractable, then so is the backward kernel. As an effect, 
the probability of sampling any change in the particle ancestry is low. This 
largely removes the effect of the extra sampling steps introduced in order to 
reduce path degeneracy. In fact, if the transition is truly degenerate, then 
the probability of updating the ancestry will be exactly zero. Intuitively, the 
problem is that the only state history consistent with a particular “future state” 
is that from which the future state was originally generated. 

To mitigate this effect we propose to use a procedure which we call particle 
rejuvenation. A specihc instance of this technique has previously been used 
together with forward/backward particle smoothers [5] —when sampling an an¬ 
cestor index they simultaneously sample a new value for the associated state. 
In the preliminary work [7] we investigated the effect of this approach on the 
PGBS sampler. This opened up for steering the potential state histories towards 
the fixed future, consequently increasing the probability of changing the ances¬ 
try and thus improving the mixing of the Markov chain. Independently, Garter 
et al. have derived essentially the same method. However, they introduce 
a different extended target distribution in order to justify this addition, which 
necessitates changes to the underlying SMG sampler, whereas our developments 
allows us to use the standard PMCMG construction. 

In the present work we extend the method from HI 11 by considering a more 
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generic setting, in which we propose to rejuvenate any “subset” of the reference 
trajectory along with the particle ancestry (Section]^. This provides additional 
flexibility over previous approaches. As we illustrate in Section]^ this increased 
flexibility is necessary in order to address the challenges associated with sev¬ 
eral models of interest containing (nearly) degenerate transitions (type (Ml)). 
Furthermore, in Section we show how an intractable transition can be ap¬ 
proximated with a nearly degenerate one. Employing the particle rejuvenation 
strategy to this approximation results in inference methods applicable to models 
of type (M2) where /(•), and thus the backward kernel ([^, is intractable. Here, 
again, the increased flexibility provided by the generalised particle rejuvenation 
strategy is key. We work specifically with the PGAS algorithm, but all the 
proposed modifications are also applicable to PGBS and, in fact, to any other 
backward-simulation-based method (see [30]). 

3 Particle Gibbs with Ancestor Sampling 

While SSMs comprise one of the main motivating factors for the development 
of SMG and PMCMC (and we shall return to these models in the sequel), these 
methods can be applied more generally, see, e.g., [3130]. For this reason we will 
carry out the derivation in a more general setting. 

We start by reviewing the PGAS algorithm by Lindsten et al. m- Gonsider 
the problem of sampling from a possibly high-dimensional probability distribu¬ 
tion. As before, we write Xt := {xi, ..., xt) G for the random variables of 
the model. Let 7 t(At) for t = 1, ..., T be a sequence of unnormalised densities 
on X*, which we assume can be evaluated point-wise. Let be the cor¬ 

responding normalised PDFs. The central object of interest is assumed to be 
the final PDF in the sequence: ^t{Xt)- For an SSM we would typically have 
7 t(At) = p{Xt I Yt) and = p{Xt,Yt). 

The PGAS algorithm [3T| is a procedure for constructing a Markov ker¬ 
nel on which admits Jt{Xt) as its unique stationary distribution. Gonse- 
quently, by simulating a Markov chain with transition probability given by the, 
so-called, PGAS kernel, we will (after an initial transient phase) obtain sam¬ 
ples distributed according to ^t{Xt)- The PGAS kernel can thus readily be 
used in MCMG and related Monte Garlo methods based on Markov kernels. As 
mentioned above, PGAS is an instance of PMCMC. Specifically, the construc¬ 
tion of the PGAS kernel relies on (a variant of) an SMC sampler, targeting the 
sequence of intermediate distributions 7 t(At) for t = 1, ... ,T. 

For later reference we introduce some additional notation. First, we will 
make frequent use of the notation already exemplified above, where capital 
letters are used for sequences of variables,_e.g., by writing Xt for the “past” 
history of the state sequence. Similarly, Xt+i := Xt \ Xt = (xt+i, ..., Xt) 
denotes the “future” state sequence. Here, we have used set notation to refer to 
a collection of latent variables and we will make frequent use also of this notation 
in the sequel. In particular, we will write S C Xt for an arbitrary subset 
of the latent random variables of the model, which could refer to individual 
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Figure 1: Illustration of subset notation S C Xt- Here, X = and each 
component of the vector xt is illustrated by a circle. The gray disks illustrate a 
subset S C Xg, consisting of S = {xj^t : j = 1, 2, 3 and t = 3, 4, 5, 6 }. 


components of Xt, say, when xt is a vector (see Section . For clarity, we 
exemplify this usage of set notation in Figure 

The PGAS algorithm is reminiscent of a standard SMC sampler, but with 
the important difference that one particle at each iteration is specified a priori. 
These particles, denoted as (a;(, ..., x'j,) =: X'j. and with corresponding particle 
indexes ( 61 , ..., hx) ='■ Bt, serve as a reference for the sampler, as detailed 
below. 

As in a standard SMC sampler, we approximate the sequence of target 
densities jt{Xt) for t = 1, ..., T by collections of weighted particles. Let 
{Xl_.^^,wl_i}^i be a particle system approximating ^t-i{Xt_i) by the em¬ 
pirical distribution. 




N 


'^t-1 


i=i 2Zi=iwi-i 




(3) 


Here, 5z{x) is a point mass located at z. To propagate this particle system 
to time t, we introduce the auxiliary variables referred to as ancestor 

indexes, encoding the genealogy of the particle system. More precisely, xl is 
generated by first sampling its ancestor, 

lP(«t =i) oc W-i> j = l,...,N. ( 4 ) 


Then, x^ is drawn from some proposal distribution. 


xl^r,i-\X:i,). ( 5 ) 

When we write XI we refer to the ancestral path of particle xl- That is, the 
particle trajectory is defined recursively as 

Xl:=ix:i„xi). ( 6 ) 


In this formulation the resampling step is implicit and corresponds to sampling 
the ancestor indexes. Finally, the particle is assigned a new importance weight: 
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wl = u}t{Xl) where the weight function is given by 


= -■ (7) 

7t_i(At_i)rt(a:t | Xt-i) 

In a standard SMC sampler, the procedure above is repeated for each i = 
1, ..., to generate N particles at time t. However, as mentioned above, the 
PGAS procedure relies on keeping one particle fixed at each iteration in order to 
obtain the correct limiting distribution of the resulting MCMC kernel. This is 
accomplished by simulating according to Q and ([^ only for i G {1, ..., N}\bt. 
The hnal particle is set deterministically according to the reference: Xj* = x^. 

To be able to construct the particle trajectory as in (§ , the reference 
particle has to be associated with an ancestor at time t — 1. This is done by 
ancestor sampling', that is, we simulate randomly a value for the corresponding 
ancestor index Oj*. Lindsten et al. m derive the ancestor sampling distribution, 
and show that a!l* should be simulated according to, 


= 


i) oc wl_^ 


7r(A-_iUX') 

lt-i{Xl_^) 


i = (8) 


Note that the expression depends on the complete “future” reference path X[ = 
{Xf, ..., x'rp). In the above, Xl_i U X^ refers to the complete path formed by 
concatenating the two partial trajectories. 

The expression above can be understood as an application of Bayes’ theo¬ 
rem; the weight is the prior probability of particle Xl_i and the density 
ratio corresponds to the likelihood of “observing” X^ given Xl_-^^. The actual 
motivation for the ancestor sampling distribution Q, however, relies on a col¬ 
lapsing argument |3Ij . We will review this idea in the subsequent section where 
we propose a generalisation of the technique to increase the flexibility of the 
PGAS algorithm. 

Finally, after a complete pass of the modified SMC procedure outlined above 
for t = I, T, a new trajectory X^ is sampled by selecting among the particle 
trajectories with probability given by their corresponding importance weights. 
That is, we sample k with P(/c = z) oc wlp, i = 1, ..., N and return X^ = A|.. 
We summarise the PGAS procedure in Algorithm [l] Note that the algorithm 
stochastically simulates the trajectory X^ conditionally on the reference trajec¬ 
tory X'rp, thus implicitly defining a Markov kernel on X^. 

Remark 1 . As pointed out in [iniEi], the indexes Bt are nuisance variables that 
are unnecessary from a practical point of view, i.e., it is possible to set Bt to any 
arbitrary convenient sequence, e.g. Bt = {N, ..., N). We use this convention 
in Algorithms and However, explicit reference to the index variables will 
simplify the derivation of the algorithm. 

Remark 2. PGAS is a variation of the PG sampler by Andrieu et al. [3]. Algo¬ 
rithmically, the only difference between the methods lies in the ancestor sam¬ 
pling step (|^, where, in the original PG sampler we would simply set = N 
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Algorithm 1 PGAS Markov kernel m 
Require: Reference trajectory G X^. 

1: Set = x{. 

2: Draw xl ri(-) for z = 1, ..., N — 1. 

3: Set icj = /ri{x\) for i = 1, ..., N. 

4: for t = 2 to T do 

5: Simulate (aj, xl) as in Q [^for z = 1, ..., A — 1. 

6: Simulate according to^^. 

7: Set xf = x'f. 

8: Set Xi = ccj) for t = 1, ..., A. 

9: Set wl = u}t{Xl) for i = 1, ..., A. 

10: end for 

11: Draw k with P(A: = z) oc 
12: return X^ = X^. 


(deterministically). While this is a small modification, it can have a very large 
impact on the convergence speed of the method, as discussed and illustrated in 
|31j . Informally, the reason for this is that the path degeneracy of SMC samplers 
will cause the PC sampler to degenerate toward the reference trajectory. That 
is, with high probability x* = x'^ for any s <^T, effectively causing the sampler 
to be stuck at its initial value in certain parts of the state space X^. Ances¬ 
tor sampling mitigates this issue by assigning a new ancestry of the reference 
trajectory via Q. Path degeneracy will still occur, but the sampler tends to 
degenerate toward something different than the reference trajectory (a;* ^ x'^ 
with non-negligible probability) enabling much more efficient exploration of the 
state space. 

4 Particle rejuvenation for the PGAS kernel 

We now turn to the new procedure: a particle rejuvenation strategy for PGAS. 
The idea is to update the ancestor index a!l* jointly with some subset St C X^ 
of the reference trajectory. The intuitive motivation is that by “loosening up” 
the reference trajectory, we get more freedom of changing its ancestry. Before 
we continue with the derivation of the particle rejuvenation strategy, however, 
we shall consider a simple numerical example to motivate the development. 

4.1 An illustrative example 

Suppose we wish to track the motion of an object in 3D space from noisy mea¬ 
surements of its position. For the transition model fe{xt \ Xt-i) we assume near 
constant velocity motion [29j , and the observation model is noisy measurements 
of bearing, elevation and range. Here we have also introduced an unknown sys¬ 
tem parameter 9 corresponding to the scale factor on the transition covariance 






matrix, which characterises the target manoeuvrability. Based on a batch of 
T = 100 observations Yr, we wish to learn the unknown parameter 6 by Gibbs 
sampling, iteratively simulating from p{9\Xt^Yt) and p[Xt \ 9,Yt). (See [7] 
for more details on the model and experimental setup.) 

This model can be problematic for PGAS if the scale parameter 9 is small, 
since this implies a highly informative, i.e., nearly degenerate, transition model. 
To be more precise, for an SSM, with the unnormalised target density being 
le,t{Xt) = p{Xt^ Yt\9), it follows that the ancestor sampling distribution in (|^ 
simplifies to 


P(«t‘ = ^) 




1 = 1, (9) 


Now, if 9 —the process noise—is small, then fe{xt \ Xt-i) is largely concentrated 
on a single point. Hence, the distribution in will also be concentrated on 
one value and there is little freedom in changing the ancestry of x[ at time t—1. 
The result is that the effect of ancestor sampling is diminished. 

The idea with particle rejuvenation is that simultaneously sampling a new 
state x^, for instance, jointly with the ancestor index opens up for bridging be¬ 
tween the states x\_i and This leads to a substantially higher probability 

of updating the ancestor indexes during each iteration, and hence faster mixing. 

Using this rejuvenation strategy roughly doubles the computation required 
to execute each iteration of PGAS. In simulations, the median factor of im¬ 
provement in the probability of accepting an ancestor change at each time step 
is 2.4 (inter-quartile range 1.9-2.8). By contrast, simply doubling the number 
of particles results in a median factor of improvement of only 1.5 (inter-quartile 
range 1.4-1.6). The difference is even more pronounced in terms of autocor¬ 
relation, as shown in Figure (We also ran the basic PG algorithm, without 
ancestor sampling, but due to path degeneracy the method did not converge 
and the results are therefore not reported here.) 


4.2 Extended target distribution 

The formal motivation for the validity of PMGMG algorithms is based on an 
auxiliary variables argument. More precisely, Andrieu et al. [S] introduce an 
extended target distribution which is defined on the space of all the random 
variables generated by the run of an SMG algorithm. Let 

xt := ..., xf}, and at ■= {a], ■ ■ ■, a^}, 

denote the particles and ancestor (resampling) indexes generated at time t, 
respectively. We also write Xt := {xi, ..., x*} and At := {a 2 , ..., at}. Fur¬ 
thermore, let k be the index of one specific reference trajectory. To make the 
particle indexes of the reference trajectory X^ explicit we define recursively: 
br = k and bt = for t < T. Hence, bt corresponds to the index of the 
reference particle at time t, obtained by tracing the ancestry of x^. If follows 
that Ap = {x\^ , ■. •, x'^ ). 
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Figure 2: Autocorrelation functions for the parameter 9 (scale factor of tran¬ 
sition covariance). Averages over 5 runs of PGAS without rejuvenation using 
N = 100 particles (solid red) and N = 200 particles (dashed blue), and with 
rejuvenation using N = 100 particles (dotted green). 


The extended target distribution for PMCMC samplers is then given by 
(see [5]) 

N ^ T ( N 

H’-.w) n n 

2=1 ) i =2 L 2=1 

i^bi i^bt 

( 10 ) 

A key property of this distribution is that it admits the original target distri¬ 
bution 77 - as a marginal. That is, if (X 7 -,A 7 -,fc) are distributed according to 
TT^, then the marginal distribution of is 77 -. This implies that tt^ can be 
used in place of 7 t in an MCMC scheme; this is the technique used by PMCMC 
samplers. 

4.3 Partial collapsing and particle rejuvenation 

In particular, the PGAS algorithm that we reviewed in Section [^corresponds to 
a partially collapsed Gibbs sampler for the extended target distribution tt^ . The 
complete Gibbs sweep corresponding to Algorithm is given in the appendix. 
Here, however, we will focus on the ancestor sampling step (j^. As mentioned 
in Remark [^ this step is very useful for improving the mixing of the PG algo¬ 
rithm. However, as described above, for certain classes of models the likelihood 
of the “future” reference path X[ can be very low under alternative histories 
{A(_i})^i. The PDF ratio in expression Q will thus cause the ancestor sam¬ 
pling distribution to be highly concentrated on j = bt-i, i.e., P(a( = bt-i) « 1 . 


tt (Xt, At, k ) ■- 


N i 
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In particular, this is true for nearly degenerate SSMs; recall model class (Ml). 
In fact, for truly degenerate models it may be that P(a(* = &t_i) = 1, which 
implies that the ancestor sampling step has no effect and PGAS is reduced to 
the basic PG scheme. 

Observe that, while the PGAS algorithm attempts to update the ancestry 
of the reference particles X'j,, it does not update the values of the particles 
themselves. This observation can be used to mitigate the aforementioned short¬ 
coming of the algorithm, as we will now illustrate. The proposed modification 
is conceptually simple, but its practical implications for improving the mixing 
of the PGAS algorithm can be quite substantial for many models of interest. 

The idea is to simultaneously update the ancestor index together with 
a part of the future reference trajectory A(. This results in an increased flexi¬ 
bility of bridging the future reference path with an alternative history, thereby 
increasing the probability of changing its ancestry. By “a part of”, we here refere 
to any collection of random variables C Xt (see Figure]^ for an illustration). 
Typically, the larger this subset is, the larger will the increased flexibility be. 
However, this has to be traded off with the difficulty of updating St, which can 
be substantial if 2f is overly high-dimensional. 

For notational convenience, let St = At \ St. The AS step of the PGAS 
algorithm is then replaced by a step where we simulate (a^*, St) jointly from the 
conditional distribution on {1, ..., N} x range(St): 

^^^(a^,St|Xt_l,At_l,St,Ht), (11) 


where Bt := (6t, ■ ■ ■, br)- Note that this is a so-called partially collapsed Gibbs 
move, since not all the non-simulated variables of the model are conditioned 
upon. Specifically, we have excluded all the future particles and ancestor in¬ 
dexes, except for those corresponding to the reference path. The justification 
for this is that we are sampling conceptually from the distribution 

TT^ (a^, S„ {X* \ AJ, {A* \ BJ I X*_i, At_i, S*, B,), (12) 


which is a standard Gibbs update for the extended target distribution (10). 
However, no consecutive operation will depend on the variables Xj \ Xt and 
At \ Bt, which has the implication that these variables need not be generated 
at all; see Hu- 

Following [31] we obtain an expression for the conditional distribution (111 
which much resembles the original ancestor sampling distribution (§, namely. 


-N/„bt I Xt_i, At_l, 


'0 




Bt) 


iT{XtU U St u s;) 

t-1 fat 

7t-i(^t“-i) 


(13) 


Note, however, that this is a distribution on {1, ..., N} x range(St), and sim¬ 
ulating from this distribution allows us to update the ancestor index Oj* jointly 
with a part of the reference trajectory St. Additionally, at time t = 1 we can 
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update by simulating from the conditional distribution 






Bi) cx 7 t(-i U E[). 


(14) 


In most cases, exact simulation from (13) or is not possible. However, this 


issue can be dealt with by instead simulating from some MCMC kernels leaving 
these distributions invariant, resulting in a standard combination of MCMC 
samplers, see e.g. [IS]. Hence, let Kt denote a Markov kernel on {1, ..., N} x 
range(E!t) which leaves the distribution (13) invariant (sampling from ( [l4| at 
time t = 1 follows analogously). The proposed modified PGAS method is then 
given by Algorithm 


Algorithm 2 PGAS with particle rejuvenation 
Require: Reference trajectory G X^. 

1: Simulate S* ^ •) and update X!j. accordingly: X'j. G- 

2 : Set = x'l- 

3: Draw ~ ri(-) for i = 1, ..., N — 1. 

4: Set w{ = ^i{x\)/ri{x\) for i = 1, ..., N. 

5: for t = 2 to T do 

6: Simulate (a(, xj) as in (|^ [^ for z = 1, ..., A — 1. 

7: Simulate {a ^~ Kt{{N,E^), ■) and update X!p accordingly: ^ 

8: Set X^ = Xf. 

9: Set Xi = (A“y, xj) for i = 1, ..., A. 

10: Set wl = uJt(Xl) for z = 1, ..., A. 

11: end for 

12: Draw k with P(/c = z) oc w^. 

13: return X^ = X^. 


Remark 3. The previous particle rejuvenation strategies proposed indepen¬ 
dently by us |7] and Carter et al. [^ correspond to the special case obtained 
by setting Et = Xt- However, as we shall see in Sections and [^ this is insuf¬ 
ficient in many cases. In particular, to address the challenges associated with 
some degenerate models (Ml) and models with intractable transitions (M2), 
we need additional flexibility in selecting Et- Furthermore, a difference between 
the current derivation and the one presented by Carter et al. [5] is that they 
do not make use of the technique of partial collapsing. As a consequence, they 
are forced to re-define the extended target distribution ( |Io| ), resulting in (un¬ 
necessary) modifications of the SMC scheme and it implies that their approach 
is only applicable when using an explicit backward pass (as in PGBS). 

Remark 4. While ergodicity of the kernels Kt is of practical importance in 
order to obtain a large performance improvement from the ’ancestor sampling 
& particle rejuvenation’ strategy, it is not needed to guarantee ergodicity of 
the overall sampling scheme. In particular, if Kt{{at\Et), ■) = 6, b^. .(•), the 

,^t) 

proposed method reduces to the original PG algorithm by [3| which is known 
to be uniformly geometrically ergodic under weak assumptions |32||4]. 
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Below, we present two specific techniques for designing the kernels Kt that 
can be useful in the present context. 


Metropolis-Hastings (MH) We can target using MH. From current 
values (ttf, S(), we can propose new values (a^, by drawing from, 

(15) 

where is a set of proposal weights for the ancestor index and (pt is 

a proposal density for rejuvenating the reference particles The resulting 
acceptance probability is then. 








= min < 1 






I V^t 77' 


0 u;Ei7t(XEi U U S() 7t-i(^Ei) 


(16) 


Conditional importance sampling Given that we are working within the 
PG framework, a more natural approach might be to use a conditional impor¬ 
tance sampling (GIS) Markov kernel. This can be viewed simply as an instance 
of the PG kernel applied to a single time step. Consider an importance sampling 
proposal distribution for (at,St) (cf. 

(17) 

Given the current values (a(,2(), a Markov kernel with as its stationary 
distribution can be constructed as in Algorithm The validity of this approach 
follows as a special case of the derivation of the PG kernel [3]. 


Algorithm 3 Conditional importance sampling 
Require: Current state (a(,5(). 

1: Set = a[ and = 5(. 

2: Draw (al, 5)) from ( [T7| ) for i = 1, ..., IV — 1. 

3: Set 

^fli7r(AfliU5)U5() 

Wt-l = — -- 

for i = 1, ..., N. 

4: Draw i with P(£ = i) oc u5)_i. 

5: return «,S*) = (af,5f). 
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Remark 5. We can also define the kernel Kt to be composed of m, say, iterates 
of the MH or the CIS kernel to improve its mixing speed (at the cost of an 
TO-fold increase in the computational cost of simulating from Kt). Indeed, any 
standard combination of MCMC kernels (see, e.g., [33]) targeting ( |T^ will result 
in a valid definition of Kt- 

4.4 Convergence properties 

Existing convergence analysis for particle Gibbs algorithms isainiio! can be 
extended also to the proposed modified EGAS procedure of AlgorithmHere, 
we restate the uniform ergodicity result for the PGAS algorithm presented by 
[5T] . adopted to the current settings. We write || • ||oo and Dtv for the supremum 
norm and the total variation distance, respectively. 

Theorem 1. Assume that there exists a constant k < oo such that HwtHoo < k 
for any t G {I, ..., T}. Then, for any N > 2 there exist constants Rn < oo 
and Pm G [0,1) such that 

iDTv(Law(AT[fc]),7r) < VA^ e X^, 

where the Markov chain {XT[k]}k>o is generated by iterating Algorithm^ with 
initial state ^^[O] = X'j.. 

The proof follows analogously to the proof of m Theorem 3] and is omitted 
for brevity. 


5 Degenerate and nearly degenerate models 

The method proposed in Algorithmj^can be used for a general sequence of target 
distributions {jt{Xt)}Jfi. We now turn our attention explicitly to (nearly) 
degenerate SSMs as described in (Ml) and discuss how Algorithm can be 
used for these models. 


5.1 Ancestor sampling for nearly degenerate models 

Gonsider again inference for the SSM given by 0 , with the unnormalised 
target density jt{Xt) = p{Xt,Yt). We follow the convention used in Algo¬ 
rithms!^ and that the reference particle is always placed on the IVth position. 
It follows that the ancestor sampling distribution ([^ is given by 


P(af = ^) 


wl if{x't\xl T^) 


* = 1, ...,7V. (18) 


If the state process noise is small, i.e. the transition density /(•) is nearly degen¬ 
erate, then this probability distribution can be highly concentrated on i = N, 
effectively removing the effect of ancestor sampling; we experienced this effect 
in the motivating example in Section [4.1| 
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To cope with this issue, one option is to make a partial collapse over a 
subset of the future state variables. That is, we let Sj = (xt, .. •, x^t) where 
Kt = min{T, t + i — 1} for some fixed length i. It follows that the ratio of the 
unnormalised target densities appearing in (131 can be written as 


N, = p{Xt,Yt\xt-i) oc /(a;«,+i I a;«J i fj| a;s_i)5f(?/s | I . 

(19) 

Hence, the target distribution for the modified ancestor sampling step, with 
particle rejuvenation, is defined on {1, ..., N} x and the corresponding 

PDF of (at,St) is proportional to 


f{x,\xs-i)g{y,\xs)\ f{xt\xtL^)g{yt\xt). ( 20 ) 




Simulating from this distribution can be done, e.g., by using one of the MCMC 
kernels introduced in Section 4.3 The benefit of doing this is that by rejuve¬ 
nating the reference trajectory over the variables {xt, ..., x^) we are able to 
bridge between xl_i and thereby increasing the probability of changing 

the ancestry for the reference path. 

We used this approach with i = 1 and a CIS Markov kernel for state reju¬ 
venation for the target tracking model in Section |4.1| Below, we present two 
other example applications where the aforementioned technique can be useful. 


5.2 Example: Euler-Maruyama discretisation of SDEs 

Consider a continuous-time state space model with hidden state {Z(T)}r>o, 
represented by the following stochastic differential equation (SDE) 

dZ^. = g{Zr)dT + a{Zr)dWT, ( 21 ) 

where Wr denotes a Wiener process. The process is observed indirectly through 
the observations (j/i, ..., yr), obtained at time points (ti, ..., tt), where yt 
g{yt I Zrt). A simple approach to enable inference in this model is to consider a 
time discretisation of the continuous process, using an Euler-Maruyama scheme, 
after which standard discrete-time inference techniques can be used. Eor sim¬ 
plicity, assume that the observations are equidistant, At := — Tt-i, and that 

we sample the process m times for each observation. 

Let the discrete-time state at time Tt consist of as well as the m — 1 
intermediate states, i.e. 

Xt ■= {xj.i ■■■ xj,m-i ^Tt) ( 22 ) 

where Xty = Zj^_^+jAT for j = 1, ..., m— 1. When using PGAS for this model, 
a problem is that, while increasing m makes the discretisation more accurate, 
it will also make the transition kernel of the latent process more degenerate. 
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However, this issue can be mitigated by rejuvenating the intermediate state 
variables, i.e. the state variables in between observation time points. Hence, we 
set 



(23) 


Similarly to above, it follows that the ratio of unnormalised target densities is 
given by 


1t{Xt) 


OC f{xt I Xt-l) 


(24) 


7t-i(^*-i) 



(25) 


where, by the Euler-Maruyama discretisation, 

p{Zr+AT I Zr) « NiZr + /r(Z^)Ar, a'^{Zr)AT). 


(26) 


is given by 


Hence, the unnormalised target PDF in 



(27) 


To obtain an efficient MCMC proposal distribution for this PDF, we can use 
one of the methods proposed by [18] which are based on a tractable diffusion 
bridges between and Z'^^. 

5.3 Example: Degenerate Gaussian transition 

Consider a model with a Gaussian transition, but a possibly nonlinear/non- 
Gaussian observation 


(28a) 

(28b) 


xt+i = Axt + Fvt+i 

yt ~ 9{yt\xt), 


with vt ~ Af(0,ld)- Furthermore, assume that dim(a;() = n and that rank(F) < 
n. This implies that the transition kernel of the linear Gaussian state process is 
degenerate. Models on this form are common in certain application areas, e.g., 
navigation and tracking; see |26j for several examples. 

In this case, the ancestor sampling step is even more problematic than for 
the previous example, since the transition is truly degenerate. Indeed, if we use 
the ancestor sampling distribution from ([^j^ the probability of selecting xl_i 
as the ancestor of x'^ will be zero, unless x^ —^xl_i is in the column space of F. 
However, in general, this will almost surely not be the case, except for i = N. 

^Recall that we use density notation also for the degenerate kernel. 
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Thus, the distribution ^ puts all probability mass on i = N, resulting in zero 
probability of changing the ancestry of the reference trajectory. In fact, it is 
not only for PGAS the degeneracy of the transition kernel is problematic. As 
discussed in m Section 4.6], any conventional SMC-based forward-backward 


smoother will be inapplicable for the model (28) due to the degeneracy of the 
backward kernel. 

However, by collapsing over intermediate state variables, this problem can 


be circumvented. We assume that the pair (A, F) in (281 are controllable (see 
e.g., [28] for a definition). Informally, this means that any state in the state 
space is reachable from any other state, i.e. for any {x,x') G there exists an 
integer £ and a noise realisation Vf.t+i which takes the system from Xt-i = x 


at time t — 1 to Xt+i = x' at time t -\- £. Now, let St = (xt, 


J with 


Kt = min{T, t + £—1} and where the length £ is chosen as any integer (e.g., the 
smallest) such that the matrix 


Cl = [F AF 


A^F] 


is of rank n (the existence of such an integer is guaranteed by the controllability 
assumption). 

Assuming Kt < T (the case Kt = T follows analogously) we have that the 
unnormalised target PDF in (13) is given by 



) } P{x't 




a-t \ 
t-l)^ 


(29) 


where p(a;(_|_^,St |a;“li) corresponds to the prior distribution under the linear 
Gaussian dynamics (28a) of the state sequence St and the end-point con¬ 
ditionally on the starting point Even though this distribution is degenerate 

for the model (28), our choice of £ ensures that for any [xt-i^xt+i) G X^, there 
exists a St S in the support of the distribution. 

In particular, the conditional distribution p(St | xl!_i,x^_^() is a (degenerate) 
Gaussian distribution which it is possible to sample from. For instance, this 
can be done by running a Kalman filter/backward simulator [S] I22j for time 
steps t, t + £ — \ ioT the state process ( |28a| ), with = Axt+i-i + Fvt+i 
acting as an “observation” at the final time step (i.e., we condition on the ref¬ 
erence state via a standard measurement update of the Kalman filter). 
Gonsequently, it is possible to use this as a proposal distribution in the MH 
kernel (15) or in the CIS kernel (17) to simulate from (29). Indeed, in tak¬ 
ing the ratio between the target ([^ and the proposal p(S( | a:t+^) we 

have p{xt+i,'F‘t \ Xt-i)/p{'B.t \ xt-i,xt+i) = p{xt+i \ Xt-i) which is a well-defined 
Gaussian density for any {xt-i,Xt+i) G X^. Specifically, 


p{xt+i\xt-i) = N{xt+t\A^+'^Xt-i,CiCj), 


(30) 


which is non-degenerate under the controllability condition. 
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6 Near-degenerate approximations of intractable 
transitions 

Another class of SSMs which poses large inferential challenges are models with 
intractable transition density functions as explained under label (M2). Hence, 
consider an SSM on the form 0 and assume that the transition density function 
/(•) is a regular, non-degenerate PDF which it is possible to simulate from, but 
which is not available for evaluation in closed form. A problem with these mod¬ 
els is that the backward kernel ([^ is also intractable, essentially ruling out any 
forward-backward-based inference technique. In fact, one of the main merits 
of the PMCMC samplers derived in [3] is that, in their most basic implemen¬ 
tations, they only require forward simulation of the system dynamics. These 
methods—specihcally, PG and the particle independent Metropolis-Hastings 
(PIMH) sampler—can thus be readily used for inference in models with in¬ 
tractable transitions. However, these methods (PG and PIMH) are liable to 
poor mixing unless a large number of particles are used in the underlying SMC 
samplers (see, e.g., m)- Intuitively, the reason for this is that we require the 
SMC sampler to generate approximate draws from the full joint smoothing dis¬ 
tribution^ which is difficult using only forward-simulation due to path space 
degeneracy. 

As has been demonstrated (here and in the previous literature, e.g., [311130] '). 
the PGAS sampler will in many cases enjoy much better mixing than PG and 
PIMH, in particular when using few particles N relative to the number of ob¬ 
servations T. However, the PGAS sampler is not directly applicable to models 
with intractable transitions. Indeed, to simulate from the ancestor sampling dis¬ 


tribution (18) it is necessary to evaluate the (intractable) transition PDF /(•). 
In this section, we propose one way to address this limitation. The idea is to 
approximate, as detailed below, the intractable transition PDF with a nearly de¬ 
generate transition. Using the proposed particle rejuvenation procedure, much 
in the same way as discussed in Section]^ we can then enable PGAS for this 
challenging class of SSMs. 

The proposed method is essentially a variant of the approximate Bayesian 
computation (ABC) technique Om]. However, while ABC is typically used 
for inference in models with intractable likelihoods (see, e.g., (Tl] [Mj for SMC 
implementations), we use it here to address the issue of intractable transitions. 
The idea is based on the realisation that simulating from the transition PDF 
/(•), which is assumed to be feasible, is always done by generating some “driving 
noise variable” Vt, say, which is then propagated through some function F(-) 
(note that this function can be implicitly defined by a computer program or 
simulation-based software). By explicitly introducing these noise variables, we 
can thus rewrite the original model on the equivalent form. 


Vt ~ Pv{vt), 

Xt = r(a:t_i,Ut), 
Vt ~ g{yt\xt). 


(31a) 

(31b) 

(31c) 
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Note that Xt here is given by a deterministic mapping of Xt-i and Vt- Conse¬ 
quently, the transition function for the joint state (yt,xt) is given by 


p{vt,xt\vt_i,xt_i) = Pvivt)Sr(xt.^,vt){xt), (32) 


which is a degenerate transition kernel due to the Dirac measure on Xt- 


Remark 6. The reformulation given by © can be seen as transforming the 
difficulty of having an intractable transition, to that of having a degenerate 
one. Now, if we marginalise over xi-t (which is straightforward since Xi-t is 
deterministically given by vi-,t) we obtain a model specified only in the noise 
variables {vt}t>i- This approach has previously been used by Murray et al. (SS] 
in the context of auxiliary SMC sampling and for particle marginal Metropolis- 
Hastings. It was also used by Lindsten et al. m to enable inference by PGAS in 
a model with an intractable transition. The problem with that approach, how¬ 
ever, is that the marginalisation of the x^-process introduces a non-Markovian 
dependence in the observation likelihood, resulting in a computational com¬ 
plexity for PGAS (see [STj for details). 


Simply rewriting the model as in (31) does not solve the problem, since, 
as discussed above, the degenerate transition kernel is problematic when using 
PGAS. However, by making use of an ABC approach this issue can be ad¬ 
dressed. Specifically, we make use of a near-degenerate approximation of (32). 
In the ancestor sampling step of the algorithm we replace the point-mass dis¬ 
tribution by some (for instance, Gaussian) kernel Kj : i—>■ K+ centered on 

T{xt-i,vt): 


« Ke{T{xt-l,Vt),Xt), 


(33) 


where e controls the band-width of the kernel (and thus the approximation 
error). Next, to deal with the near-degeneracy of the approximation (for small e) 
we select Sj = Vt to be rejuvenated, which results in a joint target for (at,Vt), 
as in (13), proportional to 


WtL-^Py{vt)Ke{T{x'^L^,vt),Xt). 


(34) 


The connection to ABC is perhaps most easily seen if we make use of the 
CIS kernel given in Algorithm for simulating from this distribution. Let 
iptivt I At_i, S() = Pv{vt) be the proposal distribution for the noise variable (in 
many cases this is likely to be the only sensible choice) and let = w'jjti 
in We then obtain the following ancestor sampling procedure, using the 

convention a'^ = N (cf. Algorithm]^: 

• For i = I, ..., N — I: 

— Simulate 'd\ with P(aJ = j) oc wj_i. 

- Simulate a noise realisation u) ~ p„(-) and set x) = r(x“L^,u)). 

— Compute wl_i = Ate(x),x(). 
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• Compute 'wf_i = Ke{xf,x^). 

• Simulate £ with P(£ = i) oc wl_i, i = 1, ..., N. 


• If £ < N, return = af, otherwise return = N. 


In the above, we have assumed that the kernel approximation (33) is used only 


in the ancestor sampling step of the algorithm (i.e., in the forward simulation 
of particles we use the original model ( |31[ )). An interesting implication of this 
is that there is no need to explicitly keep track of the ut-variables. Indeed, 
the CIS procedure outlined above can be expressed in words as follows: (i) 
Generate an independent set of iV — 1 resampled particles at time t — 1 and, 
for each one, simulate the system dynamics forward to obtain {xl}^^^, (ii) set 
the final particle according to the conditioning x^ = x'^, and (in) simulate a 
new ancestor for based on the closeness of {x\}f^i to x[, as measured by the 
kernel 

Remark 7. In some cases, for instance if X is high-dimensional, it can be bene¬ 


ficial to define the kernel in (33) on some summary statistic S : Xt S{xt), 
rather than on the state variable itself; see, e.g., m for details. 


7 Numerical illustration 

In this section we illustrate the particle rejuvenation strategy for PGAS on two 
examples. We have already seen the merits of the approach when compared 
to standard PGAS (and PG) on a nearly degenerate target tracking model in 
Section [4A| Hence, here we consider two alternative model classes, first a model 
with a linear Gaussian degenerate transition as discussed in Section [5.3| and then 
a model with an intractable transition as discussed in Section [6l 

7.1 Degenerate transition model 

Autoregressive models are widely used to model stochastic processes [23] . They 
may be written in state space form as a degenerate Gaussian transition model. 


Xt+l = 


ai 

OL2 

O^n— 1 

O^n 


(7 y 

1 

0 . 

0 

0 


0 

0 

1 . 

0 

0 

Xt + 

0 

_ 0 

0 . 

1 

0 _ 


_ 0 _ 


Vt+l, 


=A 


= F 


(35) 


where a = (oi ••• is a vector of regression parameters, {r’t}t>i is 

(scalar) white Gaussian noise, and dy is the process noise standard devia¬ 
tion. In this example, we model the latent state as an autoregressive pro¬ 
cess of order n = 5. We simulate the system for T = 500 time steps using 
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Figure 3: Posterior histograms for PGAS with iV = 20 and £ = 4 rejuvenated 
states (blue bars) and for PG with N = 500 (pink asterisks) for a randomly 
chosen state, 


a = (0.9 —0.8 0.7 —0.6 0.5)^ and cr« = 1. Each observation is a noisy, 
saturated measurement of the first component of Xt, modelled as. 


yt= P ^tanh(/3a;i,t) + CTeSt, 


(36) 


where Ct is t-distributed with v = i degrees of freedom. We set f3 = 0.5 and 
CTe = 0.5. 

Since rank(FF'^) = 1 < n, the transition kernel is degenerate. Gonse- 
quently, standard ancestor sampling is ineffective, and PGAS without rejuve¬ 
nation is equivalent to basic PG. However, by collapsing over states St = 
(xf, ..., xt+t-i) using £ > 4, it is possible to update the particle ancestry. We 
use the CIS Markov kernel with new ancestor indexes sampled proportional to 
the filter weights, and state sequences sampled according to p(St \ xt-i,xt+i)- 
The resulting CIS weights (see Algorithm Step are then, 

(t+i-i 'j 

^ I n I I p(^t+e I (37) 


where p(xt+^ | Xt-i) is a well-defined Gaussian density given by (30). 

We run the PGAS sampler with £ = 4 rejuvenated states and with JV = 20 
particles. To check that the sampler indeed converges to the correct posterior 
distribution we also run a PG sampler with TV = 500 particles (this value was 
chosen by trial-and-error as the smallest number required by PG to still have 
reasonable mixing). In Figure]^ we plot the histograms for the two samplers for 
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Figure 4: Empirical autocorrelations for the state variables for PGAS 

with TV = 20 and £ = 4 rejuvenated states (top) and for PG with N = 500 
(bottom). The median (thick line), 5-95 percentile (shaded area), and min/max- 
values (lighter shaded area) over the 500 state variables are reported. 


a randomly chosen state variable, 211 , 397 . As can be seen, there is a close match 
between the posterior histograms. 

We also compute the empirical autocorrelation functions for both samplers 
for all state variables {a;i,t}(£° . The results are reported in Figure]^ Despite the 
fact that it uses much fewer particles, the mixing speed of PGAS is significantly 
better than for PG. This is in agreement with previous results reported in 
the literature |31j . Indeed, the current example should mainly be seen as an 
illustration of how particle rejuvenation opens up for using backward-sampling- 
based methods, in particular PGAS, for a model where that would otherwise 
not be possible. 

7.2 Intractable transition model 

We now turn to a model with an intractable transition density to illustrate the 
ABC approximation for the PGAS sampler presented in Section]^ We consider 
inference in a stochastic version of the Lorenz ’63 model [33], given by the 
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following SDE: 
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0 

0 

as_ 


(38) 


where Wr is a three-dimensional Wiener process and the system parameters are 
cr = 10, p = 28, /3 = 8/3 and ctq = era = as = v^. The state is observed indi¬ 
rectly through noisy observations of the Q-component at regular time intervals: 
yt 1) with At = 0.01. The initial state is distributed according to 

(Qoj -Rqj 5 'o)^ ~ Af (0, 13 ). 

A system simulator is implemented based on a fine-grid Milstein discretisa¬ 
tion [ 33 . While the Milstein density for a single discretisation step is available 
[30] . it is intractable to integrate out the intermediate steps on the grid. Conse¬ 
quently, the employed simulator lacks a closed form transition density function 
and, indeed, for the purpose of this illustration it is viewed simply as a “black¬ 
box” simulator. 

We simulate the system for r € [0,10] and thus generate T = 1 000 obser¬ 
vations ( 2 / 1 , ..., j/t)- We then run PGAS with particle rejuvenation and the 
ABC approach outlined in Section to compute the posterior distribution of 
the system state at the observation time points. The method uses N = 100 
particles and a Gaussian kernel for the ABC approximation: 


Ke{x, x') = exp 




We let the kernel bandwidth range from e = 0.01toe = 10. As comparison, we 
also run both the PC and PIMH samplers from [3] with the number of particles 
N ranging from 200 to 10 000 (the computational cost per iteration is roughly 
the same for PGAS with N = 100 as for PG/PIMH with N = 200, as the main 
computational cost comes from the system simulator). 

RMSEs for the posterior means of the system states are shown in Eigure [5p] 
The bias coming from the ABC approximation is evident for large e, as the 
RMSEs level out at a non-zero value (if no approximation were made we would 
expect that the RMSE goes to zerc0 as the number of MCMC iterations in¬ 
creases). Nevertheless, comparing the results for PGAS to those obtained for 
PG and PIMH, it is evident that the ABC bias is signihcantly smaller than the 
Monte Carlo errors resulting from the poor mixing of PG and PIMH (at least 
if e is not overly large). 

As e decreases the bias diminishes, but at the expense of slower convergence 
of PGAS. The reason for this is that the probability of updating the ancestry 
decreases with e. In fact, for e = 0 the bias is completely removed, but we will 
then have zero probability of changing the ancestry and PGAS will be equivalent 
to PG (and thus suffer from the same poor convergence speed). Gomparing the 

^The “ground truth” is computed as an importance sampling estimator based on 10 000 
independent particle filters, each one using = 50 000 particles. 

®At least up to the accuracy of the “ground truth” reference sampler. 
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PGAS, N= 100 





Figure 5: RMSEs in the estimated posterior mean E[a;i: 7 ’ | yi-.r] for the Lorenz 
’63 model for PGAS using N = 100 (top), PG (bottom left), and PIMH (bottom 
right). (This figure is best viewed in color.) 
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results for PGAS using e = 0.01 with PG, however, we see that just having 
a small chance of updating the ancestry can have a significant impact on the 
mixing speed. For such a small value of e the ABC bias is clearly dominated 
by the variance, even after 24 hours of simulation, corresponding to roughly 
100 000 MCMC iterations. 


8 Discussion 


The particle rejuvenation technique presented in this paper generalises existing 
backward-simulation-based methods and opens up for a high degree of flexibility 
when implementing these procedures. This flexibility has been shown to be 
crucial for obtaining efficient samplers for several challenging types of state 
space models with (nearly) degenerate and/or intractable transitions. However, 
the technique is more generally applicable and we believe that it can be useful 
also for other types of models. In fact, we have recently made use of the particle 
rejuvenation technique in a completely different setting, namely to prove the 
validity of the nested SMC algorithm presented in m- To further investigate 
the scope and usefulness of the particle rejuvenation technique in other contexts 
is a topic for future work. 

Our main focus in this paper has been on state smoothing (or, more generally, 
inference for a latent stochastic process). However, one of the main strengths 
of PMCMC samplers, such as the PGAS algorithm that we have used as the 
basis for the presented technique, is that they can be used for joint state and 
parameter inference. For PGAS, this is typically done by implementing a two- 
stage Gibbs sampler by iterating: 


1. Simulate the parameter 9 from its full conditional given the states Xt and 
observations Yt- 

2. Simulate the states Xt from the PGAS Markov kernel, conditionally on 
9 and Yt- 

This approach can be used also with the proposed Algorithm to obtain a 
valid MCMC sampler for the joint posterior p{9,Xt \ Yt). Indeed, this is the 
method that we used to sample the parameter 9 in the illustrative example in 
Section |4.1| However, it is worth pointing out that this approach might not 
always be successful for the challenging model classes (Ml) and (M2) that 
have largely motivated the present development. The problem is that for these 
models it could be infeasible to simulate 9 from its full conditional in Step 
of the aforementioned Gibbs sampler, since the degeneracy or intractability of 
the transition density could be inherited by full conditional distribution of 0. In 
such scenarios, we thus need a different way for enabling the use of Algorithmic 
for parameter inference. We mention here two possible, albeit as of yet untested, 
approaches. 

Firstly, even if the model is degenerate or intractable, it is typically possible 
to explicitly introduce the “noise variables” {vt}f^i that drives the state transi¬ 
tion; see (31). We can then design a Gibbs (or Metropolis-within-Gibbs) sampler 
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for the extended model, with the original state variables Xt marginalised out. 
Note that we can still use Algorithmto simulate Xt, but then transform the 
states to when updating 9. This overcomes the prohibitive 0{T^) com¬ 

putational complexity associated with using PGAS for simulating the driving 
noise variables directly, as discussed in m- 

Secondly, it is possible to couple Algorithm with the particle marginal 
Metropolis-Hastings (PMMH) algorithm by Andrieu et al. [3]. PMMH simulates 
{9,Xt) jointly and implements a Metropolis-Hastings accept/reject step based 
on an estimate of the data likelihood computed by running a (forward-in-time 
only) SMC sampler. A problem with PMMH, however, is that the method 
tends to get “stuck”, due to occasional overestimation of the likelihood. We 
believe that the method proposed in this paper can be used to mitigate this 
issue. Indeed, it is possible to use Algorithmj^to refresh the likelihood estimate 
used in PMMH, while still maintaining the correct limiting distribution of the 
sampler (the details are omitted for brevity). By occasionally refreshing the 
likelihood in this way, it may thus be possible to escape the sticky states with 
overestimated likelihoods that deteriorate the practical performance of PMMH. 

Investigating the effectiveness of these approaches, as well as enabling pa¬ 
rameter inference in models of types (Ml) and (M2) by using the method 
presented in Algorithm in a more direct sense, are topics for future work. 

Another interesting and important direction for future work is to analyse 
the effect of the ABC approximation (33). It was found empirically in m that 
PCAS appears to be robust to approximation errors in the ancestor sampling 
weights, and this is in agreements with our findings reported in Section |7.2[ 
However, a more theoretical analysis is called for to understand if the sampler 
affected by the ABC approximation still admits a limiting distribution and, if 
so, how this distribution is affected by the approximation error. 


A Partially collapsed Gibbs sampler 

The original PCAS method, reviewed in Algorithm corresponds to the fol¬ 
lowing partially collapsed Cil^s sampler for the extended^ target distribution 
(flol); see [3l]: Given At = ATi = e and Bt = Bi = {N, ..., N) € 

(i) Draw ~ tt^ (• | Ai, Hi), 

(ii) For t = 2 to T, draw: 

(a) (x.-'Sa,-"*) ^ <(• I Xt_i, Ai_i, Ai,.gt_i), 

(b) a^ ~^^(-|X*_i,At_i,At,Ht), 

(iii) Draw fc ~ TTy (• I Xt, At). 

Similarly, the proposed PGAS algorithm with particle rejuvenation presented 
in Algorithm corresponds to the following partially collapsed Gibbs sampler 
for 
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(i) Draw Si ~ 7r^( • | 

(ii) Draw ~ TTy (• I Xi, _Bi), 

(in) For i = 2 to T, draw: 

(a) (x-''‘,a-''‘) ^ I Xt_i, Ai_i,Xi,Bt_i), 

(b) (a^, Si) ~ (• I Xi_i, Ai_i, Si, 5i), 

(iv) Draw fc ~ TTy (• I Xt, At). 

More precisely, Si and (ai*,St) are sampled from the Markov kernels Ki and 
At, respectively, in Steps (i) and (iii-b). However, since these Markov kernels 
are constructed to leave the corresponding conditional distributions invariant, 
this corresponds to a standard composition of MCMC kernels. The fact that 
the Gibbs sampler outlined above is properly collapsed, and thus leaves tt^ 
invariant, follows by analogous arguments as in the proof of [HI Theorem 1]. 
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